https://cran.r-project.org/web/packages/timekit/index.html
http://archive.ics.uci.edu/ml/datasets/Online+Retail
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
Attribute Information:
InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation. StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal. Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated. UnitPrice: Unit price. Numeric, Product price per unit in sterling. CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. Country: Country name. Nominal, the name of the country where each customer resides.
Relevant Papers:
The evolution of direct, data and digital marketing, Richard Webber, Journal of Direct, Data and Digital Marketing Practice (2013) 14, 291–309. Clustering Experiments on Big Transaction Data for Market Segmentation, Ashishkumar Singh, Grace Rumantir, Annie South, Blair Bethwaite, Proceedings of the 2014 International Conference on Big Data Science and Computing. A decision-making framework for precision marketing, Zhen You, Yain-Whar Si, Defu Zhang, XiangXiang Zeng, Stephen C.H. Leung c, Tao Li, Expert Systems with Applications, 42 (2015) 3357–3367.
Citation Request:
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
R for Data Science Garrett Grolemund Hadley Wickham
library(tidyverse)
library(tidyquant)
library(modelr)
options(na.action = na.warn)
library(timekit)
library(broom)
library(gridExtra)
library(grid)
retail <- read_csv("OnlineRetail.csv",
col_types = cols(
InvoiceNo = col_character(),
StockCode = col_character(),
Description = col_character(),
Quantity = col_integer(),
InvoiceDate = col_datetime("%m/%d/%Y %H:%M"),
UnitPrice = col_double(),
CustomerID = col_integer(),
Country = col_character()
)) %>%
mutate(day = parse_date(format(InvoiceDate, "%Y-%m-%d")),
day_of_week = wday(day, label = TRUE),
time = parse_time(format(InvoiceDate, "%H:%M")),
month = format(InvoiceDate, "%m"),
income = Quantity * UnitPrice,
income_return = ifelse(Quantity >= 0, "income", "return"))
p1 <- retail %>%
filter(Country == "United Kingdom") %>%
ggplot(aes(x = Country, fill = income_return)) +
geom_bar() +
scale_fill_manual(values = palette_light()) +
theme_tq() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
guides(fill = FALSE)
p2 <- retail %>%
filter(Country != "United Kingdom") %>%
ggplot(aes(x = Country, fill = income_return)) +
geom_bar() +
scale_fill_manual(values = palette_light()) +
theme_tq() +
theme(legend.position = "right") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
grid.arrange(p1, p2, widths = c(0.2, 0.8))
retail %>%
ggplot(aes(x = day, color = income_return)) +
geom_freqpoly(bins = 100, size = 1, alpha = 0.6) +
scale_color_manual(values = palette_light()) +
theme_tq() +
labs(title = "Number of purchases/returns per day")
retail %>%
ggplot(aes(x = day, y = ..density.., color = income_return)) +
geom_freqpoly(size = 1, alpha = 0.7, bins = 100) +
scale_color_manual(values = palette_light()) +
theme_tq() +
labs(title = "Density of purchases/returns per day")
retail %>%
group_by(day, income_return) %>%
summarise(sum_income = sum(income)) %>%
ggplot(aes(x = day, y = sum_income, color = income_return)) +
geom_ref_line(h = 0, colour = "grey") +
geom_line(size = 1, alpha = 0.6) +
scale_color_manual(values = palette_light()) +
theme_tq() +
labs(title = "Income from purchases/returns per day")
retail %>%
group_by(time, income_return) %>%
summarise(sum_income = sum(income)) %>%
ggplot(aes(x = time, y = sum_income, color = income_return)) +
geom_ref_line(h = 0, colour = "grey") +
geom_line(size = 1, alpha = 0.6) +
scale_color_manual(values = palette_light()) +
theme_tq() +
labs(title = "Income from purchases/returns per time of day")
retail %>%
ggplot(aes(x = time, y = day)) +
stat_bin2d(alpha = 0.8, bins = 25, color = "white") +
scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
theme_tq() +
theme(legend.position = "right") +
labs(title = "Purchases/returns per day and time")
retail %>%
mutate(day2 = format(InvoiceDate, "%d")) %>%
group_by(month, day2) %>%
summarise(sum_income = sum(income)) %>%
ggplot(aes(x = month, y = day2, fill = sum_income)) +
geom_tile(alpha = 0.8, color = "white") +
scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
theme_tq() +
theme(legend.position = "right") +
labs(title = "Net income per month and day")
range(retail$day)
retail<- retail %>%
mutate(model = ifelse(day <= "2011-11-01", "train", "test"))
rep_customer <- retail %>%
group_by(day, CustomerID) %>%
summarise(n = n()) %>%
mutate(repeat_customer = ifelse(duplicated(CustomerID), TRUE, FALSE))
length(which(rep_customer$repeat_customer == TRUE))
country_purch <- retail %>%
mutate(Country2 = ifelse(Country == "United Kingdom", "uk", "other_country")) %>%
group_by(day, Country2) %>%
summarise(n = n()) %>%
spread(key = Country2, value = n)
n <- retail %>%
group_by(day, income_return) %>%
summarise(n = n()) %>%
spread(key = income_return, value = n)
retail_p_day <- retail %>%
group_by(day) %>%
summarise(sum_income = sum(income),
mean_income = mean(income),
sum_income_purch = sum(income[income >= 0]),
sum_loss_return = sum(income[income < 0]),
mean_income_purch = mean(income[income >= 0]),
mean_loss_return = mean(income[income < 0]),
sum_quantity = sum(Quantity),
mean_quantity = mean(Quantity),
sum_quantity_purch = sum(Quantity[Quantity >= 0]),
sum_loss_return = sum(Quantity[Quantity < 0]),
mean_quantity_purch = mean(Quantity[Quantity >= 0]),
mean_loss_return = mean(Quantity[Quantity < 0]),
mean_unit_price = mean(UnitPrice)) %>%
mutate(diff_sum_income = sum_income - lag(sum_income)) %>%
left_join(n, by = "day") %>%
left_join(country_purch, by = "day") %>%
left_join(distinct(select(retail, day, day_of_week, month, model)), by = "day") %>%
mutate(season = ifelse(month %in% c("03", "04", "05"), "spring",
ifelse(month %in% c("06", "07", "08"), "summer",
ifelse(month %in% c("09", "10", "11"), "fall", "winter"))))
retail_p_day %>%
ggplot(aes(x = day_of_week, y = sum_income)) +
geom_boxplot(alpha = 0.8, fill = palette_light()[[1]]) +
theme_tq()
Understanding the long-term trend is challenging because there’s a day-of-week effect (lower values on Sundays) that dominates the subtler patterns.
retail_p_day %>%
ggplot(aes(x = day, y = sum_income)) +
geom_line(alpha = 0.5) +
geom_point(aes(color = day_of_week), size = 2) +
scale_color_manual(values = palette_light()) +
theme_tq() +
xlim(c(as.Date("2011-10-01", format = "%Y-%m-%d"),
as.Date("2011-11-01", format = "%Y-%m-%d")))
retail_p_day %>%
gather(x, y, diff_sum_income, sum_income, mean_income, sum_quantity, mean_quantity) %>%
ggplot(aes(x = day, y = y, color = model)) +
facet_wrap(~ x, ncol = 1, scales = "free") +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
retail_p_day %>%
gather(x, y, sum_income, mean_income, sum_quantity, mean_quantity) %>%
ggplot(aes(x = y, fill = model, color = model)) +
facet_wrap(~ x, ncol = 2, scales = "free") +
geom_density(alpha = 0.6) +
scale_fill_manual(values = palette_light()) +
scale_color_manual(values = palette_light()) +
theme_tq()
retail_p_day %>%
select(sum_income, mean_unit_price, income, return, other_country, uk, model) %>%
gather(x, y, mean_unit_price:uk) %>%
ggplot(aes(x = y, y = sum_income, color = model)) +
facet_wrap(~ x, scales = "free") +
geom_point(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
The goal of a model is to provide a simple low-dimensional summary of a dataset. In the context of this book we’re going to use models to partition data into patterns and residuals. Strong patterns will hide subtler trends, so we’ll use models to help peel back layers of structure as we explore a dataset.
However, before we can start using models on interesting, real, datasets, you need to understand the basics of how models work. For that reason, this chapter of the book is unique because it uses only simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
There are two parts to a model:
First, you define a family of models that express a precise, but generic, pattern that you want to capture. For example, the pattern might be a straight line, or a quadatric curve. You will express the model family as an equation like y = a_1 * x + a_2 or y = a_1 * x ^ a_2. Here, x and y are known variables from your data, and a_1 and a_2 are parameters that can vary to capture different patterns.
Next, you generate a fitted model by finding the model from the family that is the closest to your data. This takes the generic model family and makes it specific, like y = 3 * x + 7 or y = 9 * x ^ 2.
It’s important to understand that a fitted model is just the closest model from a family of models. That implies that you have the “best” model (according to some criteria); it doesn’t imply that you have a good model and it certainly doesn’t imply that the model is “true”. George Box puts this well in his famous aphorism:
All models are wrong, but some are useful.
It’s worth reading the fuller context of the quote:
Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an “ideal” gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.
For such a model there is no need to ask the question “Is the model true?”. If “truth” is to be the “whole truth” the answer must be “No”. The only question of interest is “Is the model illuminating and useful?”.
The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.
We will take advantage of the fact that you can think about a model partitioning your data into pattern and residuals. We’ll find patterns with visualisation, then make them concrete and precise with a model. We’ll then repeat the process, but replace the old response variable with the residuals from the model. The goal is to transition from implicit knowledge in the data and your head to explicit knowledge in a quantitative model. This makes it easier to apply to new domains, and easier for others to use.
For very large and complex datasets this will be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on the predictive ability of the model. These approaches tend to produce black boxes: the model does a really good job at generating predictions, but you don’t know why. This is a totally reasonable approach, but it does make it hard to apply your real world knowledge to the model. That, in turn, makes it difficult to assess whether or not the model will continue to work in the long-term, as fundamentals change. For most real models, I’d expect you to use some combination of this approach and a more classic automated approach.
train <- filter(retail_p_day, model == "train") %>%
select(day, sum_income, day_of_week, mean_unit_price, income, return)
cor(select(train, -day, -day_of_week), use = "na.or.complete") %>%
as.data.frame() %>%
tibble::rownames_to_column() %>%
gather(x, y, sum_income:income) %>%
ggplot(aes(x = x, y = rowname, fill = y)) +
geom_tile(alpha = 0.8, color = "white") +
scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
theme_tq() +
theme(legend.position = "right")
One way to remove this day of the week pattern is to use a model. First, we fit the model, and display its predictions overlaid on the original data:
lm_dow <- lm(sum_income ~ day_of_week, data = select(train, -day))
pred <- train %>%
add_predictions(lm_dow) %>%
add_residuals(lm_dow)
pred %>%
ggplot(aes(x = day_of_week, y = sum_income)) +
geom_boxplot(alpha = 0.8, fill = palette_light()[[1]]) +
geom_point(aes(y = pred), color = palette_light()[[2]], size = 3) +
theme_tq()
Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is useful because now that we’ve removed much of the large day-of-week effect, we can see some of the subtler patterns that remain:
Our model seems to fail starting in June: you can still see a strong regular pattern that our model hasn’t captured. Drawing a plot with one line for each day of the week makes the cause easier to see:
test <- filter(retail_p_day, model == "test") %>%
add_predictions(lm_dow) %>%
add_residuals(lm_dow)
rbind(select(pred, day, sum_income, pred),
select(test, day, sum_income, pred)) %>%
rename(original = sum_income) %>%
gather(x, y, original, pred) %>%
ggplot(aes(x = day, y = y, color = x)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
rbind(select(pred, day, day_of_week, resid),
select(test, day,day_of_week, resid)) %>%
ggplot(aes(x = day, y = resid)) +
geom_ref_line(h = 0, colour = "grey") +
geom_line(aes(color = day_of_week), alpha = 0.7) +
scale_color_manual(values = palette_light()) +
geom_smooth(method = "loess") +
theme_tq()
Our model fails to accurately predict the number of flights on Saturday: during summer there are more flights than we expect, and during Fall there are fewer. We’ll see how we can do better to capture this pattern in the next section.
There are fewer flights in January (and December), and more in summer (May-Sep). We can’t do much with this pattern quantitatively, because we only have a single year of data. But we can use our domain knowledge to brainstorm potential explanations.
What drives the income? Which variable is the strongest predictor?
The flip-side of predictions are residuals. The predictions tells you the pattern that the model has captured, and the residuals tell you what the model has missed. The residuals are just the distances between the observed and predicted values that we computed above.
lm_train_income <- lm(sum_income ~ ., data = select(train, -day))
test <- filter(retail_p_day, model == "test") %>%
add_predictions(lm_train_income) %>%
add_residuals(lm_train_income)
how far away are the predictions from the observed values? Note that the average of the residual will always be 0. if it looks like random noise, suggesting that our model has done a good job of capturing the patterns in the dataset.
ggplot(test, aes(sum_income, resid)) +
geom_ref_line(h = 0, colour = "grey") +
geom_point(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
The residuals show that the model has clearly missed some pattern in sum_income
test %>%
gather(x, y, sum_income, pred) %>%
ggplot(aes(x = day, y = y, color = x)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
lm_train_income_int <- lm(sum_income ~ mean_unit_price * income * return, data = select(train, -day))
lm_train_income_int_step <- step(lm_train_income_int, direction = "backward", trace = 0)
#tidy(lm_train_income_int_step)
test <- test %>%
add_predictions(lm_train_income_int_step, "pred_lm_int_step") %>%
add_residuals(lm_train_income_int_step, "resid_lm_int_step")
ggplot(test, aes(sum_income, resid_lm_int_step)) +
geom_ref_line(h = 0, colour = "grey") +
geom_point(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
Maybe lag is a better respone to predict?
Try ns()
Try MASS::rlm
train <- droplevels(train)
rlm_fit <- MASS::rlm(sum_income ~ day_of_week * poly(mean_unit_price, 5) + income + return, data = select(train, -day))
## Warning: Dropping 2 rows with missing values
## Warning in rlm.default(x, y, weights, method = method, wt.method =
## wt.method, : 'rlm' failed to converge in 20 steps
test <- test %>%
add_predictions(rlm_fit, "pred_rlm") %>%
add_residuals(rlm_fit, "resid_rlm")
ggplot(test, aes(sum_income, resid_rlm)) +
geom_ref_line(h = 0, colour = "grey") +
geom_point(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
test %>%
gather(x, y, sum_income, pred_rlm) %>%
ggplot(aes(x = day, y = y, color = x)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
broom::augment(rlm_fit, data = train) %>%
gather(x, y, .fitted:.hat) %>%
ggplot(aes(x = day)) +
facet_wrap(~ x, scales = "free") +
geom_line(aes(y = sum_income), color = palette_light()[[1]]) +
geom_line(aes(y = y), color = palette_light()[[2]]) +
theme_tq()
One way is to use the same approach as in the last chapter: there’s a strong signal (overall linear growth) that makes it hard to see subtler trends. We’ll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what’s left.
Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do that, we need a new data structure: the nested data frame. To create a nested data frame we start with a grouped data frame, and “nest” it:
This creates an data frame that has one row per group (per country), and a rather unusual column: data. data is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I’ll explain shortly why I think this is a good idea.
The data column is a little tricky to look at because it’s a moderately complicated list, and we’re still working on good tools to explore these objects. Unfortunately using str() is not recommended as it will often produce very long output. But if you pluck out a single element from the data column you’ll see that it contains all the data for that country (in this case, Afghanistan).
dow_model <- function(df) {
lm(sum_income ~ mean_unit_price * income + return, data = df)
}
by_dof <- train %>%
group_by(day_of_week) %>%
nest() %>%
mutate(model = purrr::map(data, dow_model),
preds = purrr::map2(data, model, add_predictions),
resids = purrr::map2(data, model, add_residuals))
## Warning: Dropping 2 rows with missing values
unnest(by_dof, resids) %>%
ggplot(aes(x = sum_income, y = resid)) +
facet_wrap(~day_of_week, scales = "free") +
geom_ref_line(h = 0, colour = "grey") +
geom_point(aes(color = day_of_week), alpha = 0.5) +
geom_line(aes(color = day_of_week), alpha = 0.8) +
scale_color_manual(values = palette_light()) +
geom_smooth(se = FALSE, method = "lm") +
theme_tq()
It looks like we’ve missed some mild patterns. There’s also something interesting going on in Africa: we see some very large residuals which suggests our model isn’t fitting so well there. We’ll explore that more in the next section, attacking it from a slightly different angle.
Instead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we’ll show a different approach using the broom package. The broom package provides a general set of functions to turn models into tidy data. Here we’ll use broom::glance() to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row:
by_dof %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance) %>%
gather(x, y, r.squared:df.residual) %>%
ggplot(aes(x = day_of_week, y = y)) +
facet_wrap(~ x, scales = "free", ncol = 4) +
geom_bar(stat = "identity", fill = palette_light()[[1]], alpha = 0.8) +
theme_tq()
The Tuesday model is particularly bad. What’s going on there?
train %>%
ggplot(aes(x = day, y = sum_income)) +
facet_wrap(~ day_of_week) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
Tuesday has a few very strong outliers.